Gap statistics for whole genome shotgun DNA sequencing projects

نویسندگان

  • Michael C. Wendl
  • Shiaw-Pyng Yang
چکیده

MOTIVATION Investigators utilize gap estimates for DNA sequencing projects. Standard theories assume sequences are independently and identically distributed, leading to appreciable under-prediction of gaps. RESULTS Using a statistical scaling factor and data from 20 representative whole genome shotgun projects, we construct regression equations that relate coverage to a normalized gap measure. Prokaryotic genomes do not correlate to sequence coverage, while eukaryotes show strong correlation if the chaff is ignored. Gaps decrease at an exponential rate of only about one-third of that predicted via theory alone. Case studies suggest that departure from theory can largely be attributed to assembly difficulties for repeat-rich genomes, but bias and coverage anomalies are also important when repeats are sparse. Such factors cannot be readily characterized a priori, suggesting upper limits on the accuracy of gap prediction. We also find that diminishing coverage probability discussed in other studies is a theoretical artifact that does not arise for the typical project.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficiency and Accuracy of Phylogenetic Trees for Large Sequence Datasets

Since the early 1990s, the number of sequences in the GenBank database has increased exponentially, and as of February 15, 2007 had reached more than 67 million sequences, with an average size of more than 1000 base pairs (NCBI 2007). This incredible explosion of data acquisition has been fueled by the application of polymerase chain reaction (PCR) technology to environmental samples and the in...

متن کامل

PCR-assisted contig extension: stepwise strategy for bacterial genome closure.

Finishing is rate limiting for genome projects, and improvements in the efficiency of complete genome-sequence compilation will require improved protocols for gap closure. Here we report a novel approach for extending shotgun contigs and closing gaps that we termed PCR-assisted contig extension (PACE). PACE depends on the capture of rare mismatched interactions that occur between arbitrary prim...

متن کامل

Organization and Evolution of Primate Centromeric DNA from Whole-Genome Shotgun Sequence Data

The major DNA constituent of primate centromeres is alpha satellite DNA. As much as 2%-5% of sequence generated as part of primate genome sequencing projects consists of this material, which is fragmented or not assembled as part of published genome sequences due to its highly repetitive nature. Here, we develop computational methods to rapidly recover and categorize alpha-satellite sequences f...

متن کامل

An Improved Algorithm for Error Correction of Reads in DNA Fragment Assembly

Most large-scale genome sequencing projects use the whole-genome shotgun sequencing strategy, in which a genome is shattered into numerous small fragments and the fragments are then sequenced from both ends. The resulting sequences (called fragments or reads) must then be assembled to reconstruct the chromosomes of the genome. Current technology produces reads of length 600-800 base pairs (bp) ...

متن کامل

Genesis of gene structures and computational analysis of U12-type introns

Background: Whole genome shotgun sequencing produces increasingly higher coverage of a genome with random sequence reads. Progressive whole genome assembly and eventual finishing sequencing is a process that typically takes several years for large eukaryotic genomes. In the interim, all sequence reads of public sequencing projects are made available in repositories such as the NCBI Trace Archiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 20 10  شماره 

صفحات  -

تاریخ انتشار 2004